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Abstract. Generalized sparse matrix- matrix multiplication (or SpGEMM) is a key primitive for 
many high performance graph algorithms as well as for some linear solvers, such as algebraic multi- 
grid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing 
in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible 
and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block 
data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory- 
efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded 
number of processors; our experiments show scaling up to thousands of processors in a variety of test 
scenarios. 
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1. Introduction. We describe scalable parallel implementations of two sparse 
matrix kernels. The first, SpGEMM, computes the product of two sparse matrices 
over a general semiring. The second, SpRef, performs generalized indexing into a 
sparse matrix: Given vectors I and J of row and column indices, SpRef extracts the 
submatrix A(l, J). Our novel approach to SpRef uses SpGEMM as its key subroutine, 
which regularizes the computation and data access patterns; conversely, applying 
SpGEMM to SpRef emphasizes the importance of an SpGEMM implementation that 
handles arbitrary matrix shapes and sparsity patterns, and a complexity analysis that 
applies to the general case. 

Our main contributions in this paper are: first, we show that SpGEMM leads 
to a simple and efficient implementation of SpRef; second, we describe a distributed- 
memory implementation of SpGEMM that is more general in application and more 
flexible in processor layout than before; and, third, we report on extensive experiments 
with the performance of SpGEMM and SpRef. We also describe an algorithm for 
sparse matrix assignment (SpAsgn), and report its parallel performance. The SpAsgn 
operation, formally A(l, J) = B, assigns a sparse matrix to a submatrix of another 
sparse matrix. It can be used to perform streaming batch updates to a graph. 

Parallel algorithms for SpGEMM and SpRef, as well as their theoretical perfor- 
mance, are described in Sections [3] and [4] We present the general SpGEMM algo- 
rithm and its parallel complexity before SpRef since the latter uses SpGEMM as 
a subroutine and its analysis uses results from the SpGEMM analysis. Section |3.1| 
summarizes our earlier results on the complexity of various SpGEMM algorithms on 
distributed memory. Section 3.3 presents our algorithm of choice, Sparse SUMMA, 
in a more formal way than before, including a pseudocode general enough to handle 
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different blocking parameters, rectangular matrices, and rectangular processor grids. 
The reader interested only in parallel SpRef can skip these sections and go directly 
to Section |4j where we describe our SpRef algorithm, its novel parallelization and its 
analysis. Section [5] gives an extensive performance evaluation of these two primitives 
using large scale parallel experiments, including a performance comparison with sim- 
ilar primitives from the Trilinos package. Various implementation decisions and their 
effects on performance are also detailed. 

2. Notation. Let A 6 § mx ™ b e a sparse rectangular matrix of elements from a 
semiring S. We use nnz(A) to denote the number of nonzero elements in A. When 
the matrix is clear from context, we drop the parenthesis and simply use nnz. For 
sparse matrix indexing, we use the convenient Matlab colon notation, where A(:,i) 
denotes the ith column, A(i, :) denotes the ith row, and A(i, j) denotes the element 
at the (i, j)th position of matrix A. Array and vector indices are 1-based throughout 
this paper. The length of an array I, denoted by len(\), is equal to its number of 
elements. For one-dimensional arrays, l(z) denotes the ith component of the array. 
We use flops(A • B), pronounced "flops", to denote the number of nonzero arithmetic 
operations required when computing the product of matrices A and B. Since the 
flops required to form the matrix triple product differ depending on the order of 
multiplication, flops((AB) • C) and flops(A • (BC)) mean different things. The former 
is the flops needed to multiply the product AB with C, where the latter is the flops 
needed to multiply A with the product BC. When the operation and the operands are 
clear from context, we simply use flops. The Matlab sparse(i, j ,v,m,n) function, 
which is used in some of the pseudocode, creates an m x n sparse matrix A with 
nonzeros A(i(fe), j(k)) = v(k). 

In our analyses of parallel running time, the latency of sending a message over 
the communication interconnect is a, and the inverse bandwidth is j3, both expressed 
in terms of time for a floating-point operation (also accounting for the cost of cache 
misses and memory indirections associated with that floating point operation). f(x) = 
@(g(x)) means that / is bounded asymptotically by g both above and below. 

3. Sparse matrix-matrix multiplication. SpGEMM is a building block for 
many high-performance graph algorithms, including graph contraction [25], breadth- 
first search from multiple source vertices |10j . peer pressure clustering [34], recursive 
all-pairs shortest-paths [TU] , matching [33J , and cycle detection [35] . It is a subroutine 
in more traditional scientific computing applications such as multigrid interpolation 
and restriction [5] and Schur complement methods in hybrid linear solvers |37j . It also 
has applications in general computing, including parsing context-free languages !_32 t 
and colored intersection searching [29] . 

The classical serial SpGEMM algorithm for general sparse matrices was first de- 
scribed by Gustavson [5^ , and was subsequently used in Matlab [23] and CSparse [2D] . 



That algorithm, shown in Figure 3.1 runs in 0(flops + nnz +n) time, which is opti- 
mal for flops > max{ nnz, n}. It uses the popular compressed sparse column (CSC) 
format for representing its sparse matrices. Algorithm [I] gives the pseudocode for this 
column-wise serial algorithm for SpGEMM. 

3.1. Distributed memory SpGEMM. The first question for distributed mem- 
ory algorithms is "where is the data?". In parallel SpGEMM, we consider two ways 
of distributing data to processors. In ID algorithms, each processor stores a block 
of m/p rows of an m-by-n sparse matrix. In 2D algorithms, processors are logically 
organized as a rectangular p — p r x p c grid, so that a typical processor is named 
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Fig. 3.1: Multiplication of sparse matrices stored by columns [IT]. Columns of A 
are accumulated as specified by the non-zero entries in a column of B using a sparse 
accumulator or SPA [21] . The contents of the SPA are stored into a column of C once 
all required columns are accumulated. 



Algorithm 1 Column-wise formulation of serial matrix multiplication 
l: procedure Columnwise-SpGEMM(A, B, C) 
2: for j <— 1 to n do 
3: for k where B(/c,j) ^ do 

4: C(:,j)^C(:,j) + A(:,fe)-B(fc,j) 



P(i,j). Submatrices are assigned to processors according to a 2D block decomposi- 
tion: processor P(i, j) stores the submatrix Ajj of dimensions (m/p r ) x (n/p c ) in its 
local memory. We extend the colon notation to slices of submatrices: A^ : denotes the 
(m/p r ) x n slice of A collectively owned by all the processors along the ith processor 
row and A : j denotes the m x (n/p c ) slice of A collectively owned by all the processors 
along the jth processor column. 

We have previously shown that known ID SpGEMM algorithms are not scalable 
to thousands of processors [7J> while 2D algorithms can potentially speed up indefi- 
nitely albeit with decreasing efficiency. There are two reasons that the ID algorithms 
do not scale: First, their auxiliary data structures cannot be loaded and unloaded 
fast enough to amortize their costs. This loading and unloading is necessary because 
the ID algorithms proceed in stages in which only one processor broadcasts its sub- 
matrix to the others, in order to avoid running out of memory. Second, and more 
fundamentally, the communication costs of ID algorithms are not scalable regardless 
of data structures. Each processor receives nnz (of either A or B) data in the worst 
case, which implies that communication cost is on the same order as computation, 
prohibiting speedup beyond a fixed number of processors. This leaves us with 2D 
algorithms for a scalable solution. 

Our previous work [5] shows that the standard compressed column or row (CSC 
or CSR) data structures are too wasteful for storing the local submatrices arising 
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CP= 1 33333 3 455 

4- 4* 4- 

IR = 6 8 4 2 

NUM = 0.1 0.2 0.3 0.4 

Fig. 3.2: Matrix A in CSC format 



from a 2D decomposition. This is because the local submatrices are hypersparse, 
meaning that the ratio of nonzeros to dimension is asymptotically zero. The total 
memory across all processors for CSC format would be 0(n,y/p + nnz), as opposed to 
0(n + nnz) memory to store the whole matrix in CSC on a single processor. Thus a 
scalable parallel 2D data structure must respect hypersparsity. 

Similarly, any algorithm whose complexity depends on matrix dimension, such 
as Gustavson's serial SpGEMM algorithm, is asymptotically too wasteful to be used 
as a computational kernel for multiplying the hypersparse submatrices. Our Hy- 
perSparseGEMM OH], on the other hand, operates on the strictly O(nnz) doubly 
compressed sparse column (DCSC) data structure, and its time complexity does not 
depend on the matrix dimension. Section [3.2| gives a succinct summary of DCSC. 

Our HyperSparseGEMM uses an outer-product formulation whose time complex- 
ity is 0{nzc{A) + nzr(B) + flops • lg ni), where nzc{A) is the number of columns of 
A that are not entirely zero, nzr(B) is the number of rows of B that are not entirely 
zero, and ni is the number of indices i for which A(:,i) ^ and B(i,:) ^ 0. The 
extra lg ni factor at the time complexity originates from the priority queue that is 
used to merge ni outer products on the fly. The overall memory requirement of this 
algorithm is the asymptotically optimal 0(nnz(A) + nnz(H) + nnz(C)), independent 
of either matrix dimensions or flops. 

3.2. DCSC Data Structure. DCSC [8 is a further compressed version of CSC 
where repetitions in the column pointers array, which arise from empty columns, are 
not allowed. Only columns that have at least one nonzero are represented, together 
with their column indices. 
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Fig. 3.3: Matrix A in Triples format 



Fig. 3.4: Matrix A in DCSC format 



For example, consider the 9-by-9 matrix with 4 nonzeros as in Figure |3.3| Fig- 
ure |3.2| showns its CSC storage, which includes repetitions and redundancies in the 
column pointers array (CP). Our new data structure compresses this column pointers 
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array to avoid repetitions, giving CP of DCSC as in Figure [3T4] DCSC is essentially 
a sparse array of sparse columns, whereas CSC is a dense array of sparse columns. 

After removing repetitions, CP(i) does no longer refer to the «th column. A 
new JC array, which is parallel to CP, gives us the column numbers. Although our 
Hypersparse.GEMM algorithm does not need column indexing, DCSC can support 
fast column indexing by building an AUX array that contains pointers to nonzero 
columns (columns that have at least one nonzero element) in linear time. 

3.3. Sparse SUMMA algorithm. Our parallel algorithm is inspired by the 
dense matrix-matrix multiplication algorithm SUMMA [53] , used in parallel BLAS [TB] . 
SUMMA is memory efficient and easy to generalize to non-square matrices and pro- 
cessor grids. 

The pseudocode of our 2D algorithm, SparseSUMMA [7 , is shown in Algo- 
rithm [2] in its most general form. The coarseness of the algorithm can be adjusted by 
changing the block size 1 < b < gcd(k/p r , k/p c ). For the first time, we present the 
algorithm in a form general enough to handle rectangular processor grids and a wide 
range of blocking parameter choices. The pseudocode, however, requires b to evenly 
divide k/p r and k/p c for ease of presentation. This requirement can be dropped at the 
expense of having potentially multiple broadcasters along a given processor row and 
column during one iteration of the loop starting at line |4j The for ... in parallel do 
construct indicates that all of the do code blocks execute in parallel by all the pro- 
cessors. The execution of the algorithm on a rectangular grid with rectangular sparse 
matrices is illustrated in Figure |3.5| We refer to the Combinatorial BLAS source 
code [2] for additional details. 



Algorithm 2 Operation C «- AB using Sparse SUMMA 

Input: A £ § mxfe j B £ S fcx ™: sparse matrices distributed on a p r x p c processor grid 
Output: C £ § mx ™ : the product AB, similarly distributed, 
l: procedure SparseSUMMA(A, B, C) 



2: for all processors P(i, j) in parallel do 

3: Bjj «- (B, 3 ) 

4: for q = 1 to k/b do > blocking parameter b evenly divides k/p r and k/p c 

5: c = (q ■ b)/p c > c is the broadcasting processor column 

6: r — {q ■ b)/p r > r is the broadcasting processor row 

7: Icols = (q ■ b) mod p c : ((q + 1) • b) mod p c > local column range 

8: Irows = (q ■ b) mod p r : ((q + 1) • b) mod p r > local row range 

9: pjem BROADCAST (A j c ( : , Icols), P(i, :)) 

10: B rem <- BROADCAST(B r j (:, Irows), P(:, j)) 

11: dj <- C y - + HYPERSPARSEGEMM(A rem , W em ) 

12: By <— (B.y) T > Restore the original B 



The BROADCAST(A ic , P(i, :)) syntax means that the owner of A ic becomes the 
root and broadcasts its submatrix to all the processors on the ith processor row. 
Similarly for BROADCAST(B r j, P(:,j)), the owner of B rj broadcasts its submatrix 
to all the processors on the jth processor column. In lines [7}|8j we find the local 
column (for A) and row (for B) ranges for matrices that are to be broadcast during 
that iteration. They are significant only at the broadcasting processors, which can 
be determined implicitly from the first parameter of Broadcast. We index B by 
columns as opposed to rows because it has already been locally transposed in line [3j 
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Fig. 3.5: Execution of the Sparse SUMMA algorithm for sparse matrix-matrix mul- 
tiplication C = A • B. The example shows the first stage of the algorithm execution 
(the broadcast and the local update by processor P(i,j)). The two rectangular sparse 
operands A and B are of sizes m-by-lOOif and lOOK-by-n, distributed on a 5 x 4 
processor grid. Block size b is chosen to be 5K . 



This makes indexing faster since local submatrices are stored in the column-based 
DCSC sparse data structure. Using DCSC, the expected cost of fetching b consecutive 
columns of a matrix A is b plus the size (number of nonzeros) of the output. Therefore, 
the algorithm asymptotically has the same computation cost for all values of b. 

For our complexity analysis, we assume that nonzeros of input sparse matrices 
are independently and identically distributed, input matrices are n-by-n, with d > 
nonzeros per row and column on the average. The sparsity parameter d simplifies our 
analysis by making different terms in the complexity comparable to each other. For 
example, if A and B both have sparsity d, then nnz(A) = dn and flops(AB) = d 2 n. 

The communication cost of the Sparse SUMMA algorithm, for the case of p r = 
Pc = y/P, is 

„ ^/ . / nnz(A.) + nnz("B) N \ „, ^_ Bdn^ 

T comm =VP[ 2a + H — — )) =&{OLy/p+?— ), 3.1 

V P >' y/P 

and its computation cost is 

„, ^(dn d 2 n , , d 2 n . d 2 nlgJp\ ^fdn d 2 n , ,d 2 n.\ _ ,„ . 
T amp = 0l— + lg (— + =0 (— + lg 0. (3.2) 

\y/p p ^Py/P' P ' \y/p P P i 

We see that although scalability is not perfect and efficiency deteriorates as p 
increases, the achievable speedup is not bounded. Since \g(d 2 n/p) becomes negligible 
as p increases, the bottlenecks for scalability are the fidnj^Jp term of T comm and 
the dn/^fp term of T comp , which scale with ^fp. Consequently, two different scaling 
regimes are likely to be present: A close to linear scaling regime until those terms 
start to dominate and a Y/p-scaling regime afterwards. 

4. Sparse matrix indexing and subgraph selection. Given a sparse matrix 
A and two vectors I and J of indices, SpRef extracts a submatrix and stores it as 
another sparse matrix, B = A(l, J). Matrix B contains the elements in rows and 
columns J(j) of A, for i = 1, len(\) and j = 1, Zen(J), respecting the order of 
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indices. If A is the adjacency matrix of a graph, SpRcf(A, I, I) selects an induced 
subgraph. SpRef can also be used to randomly permute the rows and columns of a 
sparse matrix, a primitive in parallel matrix computations commonly used for load 
balancing |31j . 

Simple cases such as row (A(i, :)), column (A(:, i)), and element (A(i,j)) indexing 
are often handled by special purpose subroutines [11] , A parallel algorithm for the 
general case, where I and J are arbitrary vectors of indices, does not exist in the 
literature. We propose an algorithm that uses parallel SpGEMM. Our algorithm is 
amenable to performance analysis for the general case. 

A related kernel is SpAsgn, or sparse matrix assignment. This operation assigns 
a sparse matrix to a submatrix of another sparse matrix, A (I, J) = B. A variation of 
SpAsgn is A(l, J) = A(l, J) + B, which is similar to Liu's extend-add operation 30 
in finite element matrix assembly. Here we describe the sequential SpAsgn algorithm 
and its analysis, and report large-scale performance results in Section [5.2| 
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Fig. 4.1: Sparse matrix indexing (SpRef) using mixed-mode SpGEMM. On an m- 
by-n matrix A, the SpRef operation A(I,J) extracts a len(\)-by-len(J) submatrix, 
where I is a vector of row indices and J is a vector of column indices. The example 
shows B = A([2, 4], [1, 2, 3]). It performs two SpGEMM operations between a boolean 
matrix and a general-type matrix. 



4.1. Sequential algorithms for SpRef and SpAsgn. Performing SpRef by 
a triple sparse-matrix product is illustrated in Figure |4.1| The algorithm can be 
described concisely in Matlab notation as follows: 

i function B = spref (A, I, J) 

2 

3 [m, n] = size (A) ; 

4 R - spar se ( 1 : len ( I ) , I , 1 , len ( I ) , m) ; 

5 Q = sparse ( J, 1 : len ( J) , 1 , n, len (J) ) ; 

6 B - R*A*Q; 

The sequential complexity of this algorithm is flops(R • A) + flops((RA) • Q). 
Due to the special structure of the permutation matrices, the number of nonzero 
operations required to form the product R • A is equal to the number of nonzero 
elements in the product. That is, flops(R • A) = nnz(RA) < nnz(A). Similarly, 
flops((RA) • Q) < nnz(A), making the overall complexity 0(nnz(A)) for any I and J. 
This is optimal in general, since just writing down the result of a matrix permutation 
B = A(r, r) requires Q(nnz(A)) operations. 

Performing SpAsgn by two triple sparse-matrix products and additions is illus- 
trated in Figure |4~2} We create two temporary sparse matrices of the same dimensions 
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/ \ / \ 
A = A+ B - A(I,J) 

\ / \ 0/ 



Fig. 4.2: Illustration of SpAsgn (A(I,J) = B) for rings where additive inverses are 
defined. For simplicity, the vector indices I and J are shown as contiguous, but they 
need not be. 



as A. These matrices contain nonzeros only for the A(l, J) part, and zeros elsewhere. 
The first triple product embeds B into a bigger sparse matrix that we add to A. The 
second triple product embeds A (I, J) into an identically sized sparse matrix so that 
we can zero out the A(I,J) portion by subtracting it from A. Since general semir- 
ing axioms do not require additive inverses to exist, we implement this piece of the 
algorithm slightly differently that stated in the pseudocode. We still form the SAT 
product but instead of using subtraction, we use the generalized sparse elementwise 
multiplication function of the Combinatorial BLAS [10] to zero out the A (I, J) portion. 
In particular, we first perform an elementwise multiplication of A with the negation 
of SAT without explicitly forming the negated matrix, which can be dense. Thanks 
to this direct support for the implicit negation operation, the complexity bounds are 
identical to the version that uses subtraction. The negation does not assume additive 
inverses: it sets all zero entries to one and all nonzeros entries to zero. The algorithm 
can be described concisely in Matlab notation as follows: 

1 function C = spasgn (A, I , J, B) 

2 % A = spasgn (A, I, J, B) performs A(I,J) = B 

3 

4 [ma, na] = size (A) ; 

5 [nib, nb] = size(B); 

6 R = spar se ( I , 1 : mb, 1 , ma, mb) ; 

7 Q = spar se ( 1 : nb, J, 1 , nb, na) ; 

8 S = spar se ( I , I , 1 , ma, ma) ; 

9 T = sparse ( J, J, 1, na, na) ; 
10 C = A + R*B*Q - S*A*T; 

Liu's extend-add operation is similar to SpAsgn but simpler; it just omits sub- 
tracting the SAT term. 

Let us analyze the complexity of SpAsgn. Given A e § mxn and B G § ien W xien (-0, 
the intermediate boolean matrices have the following properties: 

R is m-by-Zen(l) rectangular with len{\) nonzeros, one in each column. 

Q is Zen(J)-by-n rectangular with len(X) nonzeros, one in each row. 

S is TO-by-rn symmetric with len{\) nonzeros, all located along the diagonal. 

T is n-by-n symmetric with len{\) nonzeros, all located along the diagonal. 

Theorem 4.1. The sequential SpAsgn algorithm takes 0(nnz(A) + nnz(B) + 
len(\) + len{])) time using an optimal 0(flops) SpGEMM subroutine. 

Proof. The product R- B requires flops(R- B) = nnz(RH) = nnz(B) operations 
because there is a one-to-one relationship between nonzeros in the output and flops 
performed. Similarly, flops((RB) • Q) = rmz(RBQ) = nnz(B), yielding 0(rmz(B)) 
complexity for the first triple product. The product S • A only requires len(\) flops 
since it does not need to touch nonzeros of A that do not contribute to A(l, :). Simi- 
larly, (SA) • T requires only len(J) flops. The number of nonzeros in the second triple 
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Fig. 4.3: Parallel forming of the left hand side boolean matrix R from the index vector 
I on 9 processors in a logical 3x3 grid. R will be subsequently multiplied with A to 
extract 6 rows out of 9 from A and order them as {7, 2, 5, 8, 1, 3}. 



product is nnz(SAT) = 0(len{\) + Zera(J)). The final pointwise addition and subtrac- 
tion (or generalized elementwise multiplication in the absence of additive inverses) 
operations take time on the order of the total number of nonzeros in all operands , 
which is 0(nnz{A) + wiz(B) + len(\) + len(S)). □ 

4.2. SpRef in parallel. The parallelization of SpRef poses several challenges. 
The boolean matrices have only one nonzero per row or column. For the parallel 
2D algorithm to scale well with increasing number of processors, data structures and 
algorithms should respect hypersparsity [5] • Communication should ideally take place 
along a single processor dimension, to save a factor of ^Jp in communication volume. 
As before, we assume a uniform distribution of nonzeros to processors in our analysis. 

The communication cost of forming the R matrix in parallel is the cost of Scatter 
along the processor column. For the case of vector I distributed to y/p diagonal 
processors, scattering can be implemented with an average communi catio n cost of 



0(a • \gp + (3 ■ (len(\)/y/p) [H]. This process is illustrated in Figure 4.3 The Q T 
matrix can be constructed identically, followed by a TRANSPOSE (Q T ) operation where 
each processor P(i,j) receives nnz(Q)/p = len(J)/p words of data from its diagonal 
neighbor P(j, i). Note that the communication cost of the transposition is dominated 
by the cost of forming Q T via Scatter. 

While the analysis of our parallel SpRef algorithm assumes that the index vectors 
are distributed only on diagonal processors, the asymptotic costs are identical in the 
2D case where the vectors are distributed across all the processors [T3] . This is because 
the number of elements (the amount of data) received by a given processor stays the 
same with the only difference in the algorithm being the use of Alltoall operation 
instead of Scatter during the formation of the R and Q matrices. 

The parallel performance of SpGEMM is a complicated function of the matrix 
nonzero structures [7J For SpRef, however, the special structure makes our analysis 
more precise. Suppose that the triple product is evaluated from left to right, B = 
(R • A) • Q; a similar analysis can be applied to the reverse evaluation. A conservative 
estimate of m(R, A), the number of indices i for which R(:, i) ^ and A(i, :) ^ 0, is 
nnz(TV) = len(\). 

Using our HyperSparseGEMM jHj[5] as the computational kernel, time to com- 
pute the product RA (excluding the communication costs) is: 
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Tmuit = max ( nzc(R ik ) + nzr(A kj ) + nops(R i/c • A kj ) ■ lg ni(R ik , A kj )\ 



4 ,7 

k = l 



where the maximum over all pairs is equal to the average, due to the uniform 
nonzero distribution assumption. 

Recall from the sequential analysis that flops(R- A) < nnz(A) since each nonzero 
in A contributes at most once to the overall flop count. We also know that nzc(R) = 
len(\) and nzr(A) < nnz(A). Together with the uniformity assumption, these identi- 
ties yield the following results: 

. nnz(A) 
flops(R jfe • A kj ) = — , 

pVp 

, . , . len(\) 

ni(R ik ■ A kj ) < nnz(R lk ) = — — , 

E, . . . len(\) 

nzc(R ik ) = nzc(R i: ) = — — , 

k=i » P 

VP 



nzr{A ik ) = nzr(A i: ) < 



nnz(A) 



fx ' VP 



In addition to the multiplication costs, adding intermediate triples in ^/p stages 
costs an extra flops(R; : • A : j)\gy/p = (nnz(A)/p)lg^/p operations per processor. 
Thus, we have the following estimates of computation and communication costs for 
computing the product RA: 

, „ / len(\) + nnz(A) nnz(A) Aen(\) ^ 

T comp (R • A = Of W V ' + ^ • lg(— ^ + VP 

v Vp p p 

T comm (R • A) = ft(a ■ y/p + P ■ H^l). 

VP 

Given that nnz (RA) < nnz(A), the analysis of multiplying the intermediate 
product RA with Q is similar. Combined with the cost of forming auxiliary matrices 
R and Q and the costs of transposition of Q T , the total cost of the parallel SpRcf 
algorithm becomes 



T = Oi 

' comp y I 



T — ft 

± comm ^ 



/ len(\) + fen(J) + nnz(A) nnz(A) ^ , len{\) + len(J) ,\ 
v */r> v v / 

[a ■ y/p + p 



IP P 
nnz(A) + len{\) + len(S) 



We see that SpGEMM costs dominate the cost of SpRef. The asymptotic speedup 
is limited to O(Vp), as in the case of SpGEMM. 

5. Experimental Results. We ran experiments on NERSCs Franklin sys- 
tem [1], a 9660-node Cray XT4. Each XT4 node contains a quad-core 2.3 GHz AMD 
Opteron processor, attached to the XT4 interconnect via a Cray SeaStar2 ASIC using 
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Fig. 5.1: Performance and parallel scal- 
ing of applying a random symmetric per- 
mutation to an R-MAT matrix of scale 
22. The x-axis uses a log scale. 



Fig. 5.2: Performance and parallel scal- 
ing of extracting 10 induced subgraphs 
from an R-MAT matrix of scale 22. The 
x-axis uses a log scale. 



a Hyper Transport 2 interface capable of 6.4 GB/s. The SeaStar2 routing ASICs are 
connected in a 3D torus topology, and each link is capable of 7.6 GB/s peak bidirec- 
tional bandwidth. Our algorithms perform similarly well on a fat tree topology, as 
evidenced by our experimental results on the Ranger platform that are included in 
an earlier technical report [S]. 

We used the GNU C/C++ compilers (version 4.5), and Cray's MPI implementa- 
tion, which is based on MPICH2. We incorporated our code into the Combinatorial 
BLAS framework [10]. We experimented with core counts that are perfect squares, 
because the Combinatorial BLAS currently uses a square y/p x ^fp processor grid. We 
compared performance with the Trilinos package (version 10.6.2.0) |2S], which uses a 
ID decomposition for its sparse matrices. 

In the majority of our experiments, we used synthetically generated R-MAT ma- 
trices rather than Erdos-Renyi [55] "fiat" random matrices, as these are more realistic 
for many graph analysis applications. R-MAT [T3], the Recursive MATrix generator, 
generates graphs with skewed degree distributions that approximate a power-law. A 
scale n R-MAT matrix is 2™-by-2" . Our R-MAT matrices have an average of 8 nonze- 
ros per row and column. R-MAT seed paratemeters are a = .6, and b = c = d = A/3. 
We applied a random symmetric permutation to the input matrices to balance the 
memory and the computational load. In other words, instead of storing and comput- 
ing C = AB, we compute PCP T = (PAP T )(PBP T ). All of our experiments are 
performed on double-precision floating-point inputs. 

Since algebraic multigrid on graphs coming from physical problems is an im- 
portant case, we included two more matrices from the Florida Sparse Matrix col- 



lection |21j to our experimental analysis, into Section 5.3.2 where we benchmark 
restriction operation that is used in algebraic multigrid. The first such matrix is a 
large circuit problem (Freescalel) with 17 million nonzeros and 3.42 million rows and 
columns. The second matrix comes from a structural problem (GHS_psdef/ldoor), 
and has 42.5 million nonzeros and 952, 203 rows and columns. 

5.1. Parallel Scaling of SpRef. Our first set of experiments randomly per- 
mutes the rows and columns of A, as an example case study for matrix reordering 
and partitioning. This operation corresponds to relabeling vertices of a graph. Our 
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Fig. 5.3: Observed scaling of the SpAsgn operation A(l, I) = B where A 6 K™ x ™ is an 
R-MAT matrix of scale 22 and B is another R-MAT matrix whose scale is shown in 
the figure legend. I is a duplicate-free sequence with entries randomly selected from 
the range l...n; its length matches the dimensions of B. Both axes are log scale. 



second set of experiments explores subgraph extraction by generating a random per- 
mutation of 1 : n and dividing it into k n chunks r\, . . . , r^. We then performed k 
SpRef operations of the form A(rj, r^), one after another (with a barrier in between). 
In both cases, the sequential reference is our algorithm itself. 

The performance and parallel scaling of the symmetric random permutation is 



shown in Figure 5.1 The input is an R-MAT matrix of scale 22 with approximately 
32 million nonzeros in a square matrix of dimension 2 22 . Speedup and runtime are 
plotted on different vertical axes. We see that scaling is close to linear up to about 
64 processors, and proportional to y/p afterwards, agreeing with our analysis. 

The performance of subgraph extraction for k = 10 induced subgraphs, each with 



n/k randomly chosen vertices, is shown in Figure 5.2 The algorithm performs well 
in this case too. The observed scaling is slightly less than the case of applying a 
single big permutation, which is to be expected since the multiple small subgraph 
extractions increase span and decrease available parallelism. 

5.2. Parallel Scaling of SpAsgn. We benchmarked our parallel SpAsgn code 
by replacing a portion of the input matrix (A) with a structurally similar right-hand 
side matrix (B). This operation is akin to replacing a portion of the graph due to a 
streaming update. The subset of vertices (row and column indices of A) to be updated 
is chosen randomly. In all the tests, the original graph is an R-MAT matrix of scale 
22 with 32 million nonzeros. The right-hand side (replacement) matrix is also an R- 
MAT matrix of scales 21, 20, and 19, in three subsequent experiments, replacing 50%, 
25%, and 12.5% of the original graph, respectively. The average number of nonzeros 
per row and column are also adjusted for the right hand side matrices to match the 
nonzero density of the subgraphs they are replacing. 

The performance of this sparse matrix assignment operation is shown in Fig- 



ure 5.3 Our implementation uses a small number of Combinatorial BLAS routines: 
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A sparse matrix constructor from distributed vectors, essentially a parallel version 
of Matlab's sparse routine, the generalized elementwise multiplication with direct 
support for negation, and parallel SpGEMM implemented using Sparse SUMMA. 

5.3. Parallel Scaling of Sparse SUMMA. Wc implemented two versions of 
the 2D parallel SpGEMM algorithms in C++ using MPI. The first is directly based 
on Sparse SUMMA and is synchronous in nature, using blocking broadcasts. The 
second is asynchronous and uses one-sided communication in MPI-2. We found the 
asynchronous implementation to be consistently slower than the broadcast-based syn- 
chronous implementation due to inefficient implementation of one-sided communica- 
tion routines in MPI. Therefore, we only report the performance of the synchronous 
implementation. The motivation behind the asynchronous approach, performance 
comparisons, and implementation details, can be found in our technical report [21 
Section 7]. On more than 4 cores of Franklin, synchronous implementation consis- 
tently outperformed the asynchronous implementation by 38-57%. 

Our sequential HyperSparseGEMM routines return a set of intermediate triples 
that are kept in memory up to a certain threshold without being merged immedi- 
ately. This permits more balanced merging, eliminating some unnecessary scans that 
degraded performance in a preliminary implementation [7J. 

5.3.1. Square Sparse Matrix Multiplication. In the first set of experiments, 
we multiply two structurally similar R-MAT matrices. This square multiplication is 
representative of the expansion operation used in the Markov clustering algorithm [36 . 
It is also a challenging case for our implementation due to the highly skewed nonzero 
distribution. We performed strong scaling experiments for matrix dimensions ranging 
from 2 21 to 2 24 . 

Figure pT4| shows the speedup we achieved. The graph shows linear speedup until 
around 100 processors; afterwards the speedup is proportional to the square root 
of the number of processors. Both results agree with the theoretical analysis. To 
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523.7 


23 


65.8 


65.8 


504.3 


1021.3 


24 


132.1 


132.1 


1056.9 


2137.4 



Fig. 5.5: Statistics about R-MAT prod- 
uct C = AB. All numbers (except scale) 
are in millions. 




Number of Cores 



Fig. 5.6: Demonstration of two scaling 
regimes for scale 21 R-MAT product. 



illustrate how the scaling transitions from linear to y/p, we drew trend lines on the 



scale 21 results. As shown in Figure 5.6 the slope of the log-log curve is 0.85 (close 



to linear) until 121 cores, and the slope afterwards is 0.47 (close to y/p). Figure 5.7 
zooms to the linear speedup regime, and shows the performance of our algorithm at 
lower concurrencies. The speedup and timings are plotted on different y-axes of the 
same graph. 

Our implementation of Sparse SUMMA achieves over 2 billion "useful flops" (in 
double precision) per second on 8100 cores when multiplying scale 24 R-MAT matrices. 
Since useful flops are highly dependent on the matrix structure and sparsity, we 
provide additional statistics for this operation in Table [575] Using matrices with more 
nonzeros per row and column will certainly yield higher performance rates (in useful 
flops). The gains from sparsity are clear if one considers dense flops that would be 
needed if these matrices were stored in a dense format. For example, multiplying two 
dense scale 24 matrices requires 9444 exaflops. 

Figure [5~8| breaks down the time spent in communication and computation when 
multiplying two R-MAT graphs of scale 24. We see that computation scales much 
better than communication (over 90x reduction when going from 36 to 8100 cores), 
implying that SpGEMM is communication bound for large concurrencies. For exam- 
ple, on 8100 cores, 83% of the time is spent in communication. Communication times 
include the overheads due to synchronization and load imbalance. 

Figure |5.8| also shows the effect of different blocking sizes. Remember that each 
processor owns a submatrix of size n/ y/p-by-n/ y/p. On the left, the algorithm com- 
pletes in y/p stages, each time broadcasting its whole local matrix. On the right, the 
algorithm completes in 2y/p stages, each time broadcasting half of its local matrix. 
We see that while communication costs are not affected, the computation slows down 
by 1-6% when doubling the number of stages. This difference is due to the costs of 
splitting the input matrices before the multiplication and reassembling them after- 
wards, which is small because splitting and reassembling are simple scans over the 
data whose costs are dominated by the cost of multiplication itself. 

5.3.2. Multiplication with the Restriction Operator. Multilevel methods 
are widely used in the solution of numerical and combinatorial problems [35] . Such 
methods construct smaller problems by successive coarsening. The simplest coars- 
ening is graph contraction: a contraction step chooses two or more vertices in the 
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Fig. 5.7: Performance and scaling of 
Sparse SUMMA at lower concurrencies 
(scale 21 inputs). The x-axis uses a log 
scale. 



Fig. 5.8: Communication and computa- 
tion breakdown, at various concurren- 
cies and two blocking sizes (scale 24 in- 
puts). 



original graph G to become a single aggregate vertex in the contracted graph G . 
The edges of G that used to be incident to any of the vertices forming the aggregate 
become incident to the new aggregate vertex in G' . 

Constructing coarse grid during the V-cycle of algebraic multigrid [5 or graph 
partitioning [27] is a generalized graph contraction operation. Different algorithms 
need different coarsening operators. For example, a weighted aggregation |15j might 
be preferred for partitioning problems. In general, coarsening can be represented as 
multiplication of the matrix representing the original fine domain (grid, graph, or 
hypergraph) by the restriction operator. 

In these experiments, we use a simple restriction operation to perform graph 
contraction. Gilbert et al. describe how to perform contraction using SpGEMM. 
Their algorithm creates a special sparse matrix S with n nonzeros. The triple product 
SAS T contracts the whole graph at once. Making S smaller in the first dimension 
while keeping the number of nonzeros same changes the restriction order. For example, 
we contract the graph into half by using S having dimensions n/2 x n, which is said 
to be of order 2. 



Figure 5.9 shows 'strong scaling' of AS T operation for R-MAT graphs of scale 23. 
We used restrictions of order 2, 4, and 8. Changing the interpolation order results 
in minor (less than 5%) changes in performance, as shown by the overlapping curves. 
This is further evidence that our algorithm's complexity is independent of the matrix 
dimension, because interpolation order has a profound effect on the dimension of 
the right hand side matrix, but it does not change the expected flops and numbers 
of nonzeros in the inputs (it may slightly decrease the number of nonzeros in the 
output). The experiment shows scaling up to 4096 processors. Figure 5.10 shows 
the breakdown of time (as percentages) spent on remote communication and local 
computation steps. 

Figures |5.11a| and |5.11b| show 'strong scaling' of the full restriction operation 
SAS T of order 8, using different parenthesizations for the triple product. The re- 
sults show that our code achieves llOx speedup on 1024-way concurrency and 163x 
speedup on 4096-way concurrency, and the performance is not affected by the different 
parenthesizations. 

Figure 5.12 shows the performance of full operation on real matrices from phys- 
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Fig. 5.9: Strong scaling of B AS T , 
multiplying scale 23 R-MAT matrices 
with the restriction operator on the right. 
The x-axis uses a log scale. 
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Fig. 5.10: Normalized communication 
and computation breakdown for multi- 
plying scale 23 R-MAT matrices with the 
restriction operator of order 4. 
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(a) Left to right evaluation: (SA)S T (b) Right to left evaluation: S(AS T ) 

Fig. 5.11: The full restriction operation of order 8 applied to a scale 23 R-MAT matrix. 



ical problems. Both matrices have a full diagonal that remains full after symmetric 
permutation. Due to the 2D decomposition, processors responsible for the diagonal 
blocks typically have more work to do. For load-balancing and performance reasons, 
we split these matrices into two pieces A = D + L where D is the diagonal piece 
and L is the off-diagonal piece. The restriction of rows becomes SA = SD + SL. 
Scaling the columns of S with the diagonal of D performs the former multiplication, 
and the latter multiplication uses Sparse SUMMA algorithm described in our paper. 
This splitting approach especially improved the scalability of restriction on Freescalel 
matrix, because it is much sparser that GHS_psdef/ldoor, which does not face severe 
load balancing issues. Order 2 restriction shrinks the number of nonzcros from 17.0 
to 15.3 million for Freescalel, and from 42.5 to 42.0 million for GHS_psdef/ldoor. 

5.3.3. Tall Skinny Right Hand Side Matrix. The last set of experiments 
multiplies R-MAT matrices by tall skinny matrices of varying sparsity. This compu- 
tation is representative of the parallel breadth-first search that lies at the heart of our 
distributed- memory betweenness centrality implementation [10] . This set indirectly 
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Number of Cores Number of Cores 

(a) Freescalel matrix. (b) GHS_psdcf/ldoor matrix. 



Fig. 5.12: Performance and strong scaling of Sparse SUMMA implementation for the 
full restriction of order 2 (SAS T ) on real matrices from physical problems. 



examines the sensitivity to sparsity as well, because we vary the sparsity of the right 
hand side matrix from approximately 1 to 10 5 nonzeros per column, in powers of 10. 
In this way, we imitate the patterns of the level-synchronous breadth-first search from 
multiple source vertices where the current frontier can range from a few vertices to 
hundreds of thousands |12) . 

For our experiments, the R-MAT matrices on the left hand side have d\ = 8 
nonzeros per column and their dimensions vary from n = 2 20 to n = 2 26 . The right- 
hand side is an Erdos-Renyi matrix of dimensions n-by-fc, and the number of nonzeros 
per column, d%, is varied from 1 to 10 , in powers of 10. The right-hand matrix's width 
k varies from 128 to 8192, growing proportionally to its length n, hence keeping the 
matrix aspect ratio constant at n/k — 8192. Except for the d 2 = 10 5 case, the R-MAT 
matrix has more nonzeros than the right-hand matrix. In this computation, the total 
work is W — 0(did 2 k), the total memory consumption is M = 0(d\n + d 2 k), and 
the total bandwidth requirement is 0{M-Jp). 

We performed weak scaling experiments where memory consumption per proces- 
sor is constant. Since M — 0{din + d^k), this is achieved by keeping both n/p = 2 14 
and k/p = 2 constant. Work per processor is also constant. However, per-processor 
bandwidth requirements of this algorithm increases by a factor of y/p. 

Figure [5. 1 3| shows a performance graph in three dimensions. The timings for each 
slice along the XZ-plane (i.e. for every d 2 = {1, 10, 10 5 } contour) are normalized 
to the running time on 64 processors. We do not cross-compare the absolute perfor- 
mances for different d 2 values, as our focus in this section is parallel scaling. In line 
with the theory, we observe the expected ^/p slowdown due to communication costs. 

The performance we achieved for these large scale experiments, where we ran our 
code on up to 4096 processors, is remarkable. It also shows that our implementation 
does not incur any significant overheads since it does not deviate from the Jp~ curve. 

5.4. Comparison with Trilinos. The EpetraExt package of Trilinos can mul- 
tiply two distributed sparse matrices in parallel. Trilinos can also permute matrices 
and extract submatrices through its EpetraTmport and Epetra_Export classes. These 
packages of Trilinos use a ID data layout. 

For SpGEMM, we compared the performance of Trilinos's EpetraExt package with 
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Fig. 5.13: Weak scaling of R-MAT times a tall skinny Erdos-Renyi matrix. X (pro- 
cessors) and Y (nonzeros per column on fringe) axes are logarithmic, whereas Z (nor- 
malized time) axis is linear. 



ours on two scenarios. In the first scenario, we multiplied two R-MAT matrices as 
described in Section [5.3.1| and in the second scenario, we multiplied an R-MAT matrix 
with the restriction operator of order 8 on the right as described in Section [5. 3. 2| 

Trilinos ran out of memory when multiplying R-MAT matrices of scale larger than 
21, or when using more than 256 processors. Figure [5.14a| shows SpGEMM timings 
for up to 256 processors on scale 21 data. Sparse SUMMA is consistently faster 
than Trilinos's implementation, with the gap increasing with the processor count, 
reaching 66 x on 256- way concurrency. Sparse SUMMA is also more memory efficient 
as Trilinos's matrix multiplication ran out of memory for p = 1 and p — 4 cores. 
The sweet spot for Trilinos seems to be around 120 cores, after which its performance 
degrades significantly. 

In the case of multiplying with the restriction operator, the speed and scalabil- 
ity of our implementation over EpetraExt is even more pronounced. This is shown 
in Figure |5.14b| where our code is 65X faster even on just 121 processors. Remark- 
ably, our codes scales up to 4096 cores on this problem, as shown in Section |5.3.2| 
while EpetraExt starts to slow down just beyond 16 cores. We also compared Sparse 
SUMMA with EpetraExt on matrices coming from physical pr oblem s, and the results 
for the full restriction operation (SAS T ) are shown in Figures 



5.15 



In order to benchmark Trilinos's sparse matrix indexing capabilities, we used Epe- 
traExt 's permutation class that can permute row or columns of an Epetra_CrsMatrix 
by creating a map defined by the permutation, followed by an Epetra_Export oper- 
ation to move data from the input object into the permuted object. We applied a 
random symmetric permutation on a R-MAT matrix, as done in Section [5.1| Trilinos 
shows good scaling up to 121 cores but then it starts slowing down as concurrency 
increases, eventually becoming over 10 x slower than our SpRef implementation at 
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(a) R-MAT X R-MAT product (scale 21). 



(b) Multiplication of an R-MAT matrix of scale 
23 with the restriction operator of order 8. 



Fig. 5.14: Comparison of SpGEMM implementation of Trilinos's EpetraExt package 
with our Sparse SUMMA implementation using synthetically generated matrices. The 
data labels on the plots show the speedup of Sparse SUMMA over EpetraExt. 
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Fig. 5.15: Comparison of Trilinos's EpetraExt package with our Sparse SUMMA 
implementation for the full restriction of order 2 (SAS T ) on real matrices. The data 
labels on the plots show the speedup of Sparse SUMMA over EpetraExt. 



169 cores. 

6. Conclusions and Future Work. We presented a flexible parallel sparse 
matrix-matrix multiplication (SpGEMM) algorithm, Sparse SUMMA, which scales 
to thousands of processors in distributed memory. We used Sparse SUMMA as a 
building block to design and implement scalable parallel routines for sparse matrix 
indexing (SpRef) and assignment (SpAsgn). These operations are important in the 
context of graph operations. They yield elegant algorithms for coarsening graphs by 
edge contraction as in Figure [6?T] extracting subgraphs, performing parallel breadth- 
first search from multiple source vertices, and performing batch updates to a graph. 

We performed parallel complexity analyses of our primitives. In particular, us- 



20 



BULUg AND GILBERT 



A1 




Fig. 6.1: Example of graph coarsening using edge contraction, which can be imple- 
mented via a triple sparse matrix product SAS T where S is the restriction operator. 



ing SpGEMM as a building block enabled the most general analysis of SpRef. Our 
extensive experiments confirmed that our implementation achieves the performance 
predicted by our analyses. 

Our SpGEMM routine might be extended to handle matrix chain products. In 
particular, the sparse matrix triple product is used in the coarsening phase of algebraic 
multigrid [SJ. Sparse matrix indexing and parallel graph contraction also require 
sparse matrix triple products [25] . Providing a first-class primitive for sparse matrix 
chain products would eliminate temporary intermediate products and allow more 
optimization, such as performing structure prediction [17] and determining the best 
order of multiplication based on the sparsity structure of the matrices. 

As we show in Section 5.3 our implementation spends more than 75% of its time 
in inter-node communication after 2000 processors. Scaling to higher concurrencies 
require asymptotic reductions in communication volume. We are working on develop- 
ing practical communication-avoiding algorithms [4] for sparse matrix-matrix multi- 
plication (and consequently for sparse matrix indexing and assignment), which might 
require inventing efficient novel sparse data structures to support such algorithms. 

Our preliminary experiments suggest that synchronous algorithms for SpGEMM 
cause considerably higher load imbalance than asynchronous ones [9, Section 7]. In 
particular, a truly one-sided implementation can perform up to 46% faster when mul- 
tiplying two R-MAT matrices of scale 20 using 4000 processors. We will experiment 
with partitioned global address space (PGAS) languages, such as UPC [TH], because 
the current implementations of one-sided MPI-2 were not able to deliver satisfactory 
performance when used to implement asynchronous versions of our algorithms. 

As the number of cores per node increases due to multicore scaling, so does the 
contention on the network interface card. Without hierarchical parallelism that ex- 
ploits the faster on-chip network, the flat MPI parallelism will be unscalable because 
more processes will be competing for the same network link. Therefore, designing hi- 
erarchically parallel SpGEMM and SpRef algorithms is an important future direction. 
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